The data-set aims to answer the following key questions:
# Import necessary libraries.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# To enable plotting graphs in Jupyter notebook
%matplotlib inline
# Load the data into pandas dataframe
data = pd.read_csv('Life Expectancy Data.csv') # Make changes to the path depending on where your data file is stored.
data.head()
# Check number of rows and columns
data.shape
# Have a look at the column names
data.columns
# Check column types and missing values
data.info()
# remove the rows of data which have missing value(s)
data = data.dropna()
# Check the unique values in each column of the dataframe.
data.nunique()
plt.figure(figsize=(10,7))
plt.scatter(data['Schooling'], data['Life expectancy'], color='red')
plt.title('Life expectancy Vs Schooling', fontsize=14)
plt.xlabel('Schooling', fontsize=14)
plt.ylabel('Life expectancy', fontsize=14)
plt.grid(True)
plt.show()
plt.figure(figsize=(10,7))
plt.scatter(data['Measles'], data['Life expectancy'], color='red')
plt.title('Life expectancy Vs Measles ', fontsize=14)
plt.xlabel('Measles', fontsize=14)
plt.ylabel('Life expectancy', fontsize=14)
plt.grid(True)
plt.show()
plt.figure(figsize=(10,7))
plt.scatter(data['Alcohol'], data['Life expectancy'], color='red')
plt.title('Life expectancy Vs Alcohol', fontsize=14)
plt.xlabel('Alcohol', fontsize=14)
plt.ylabel('Life expectancy', fontsize=14)
plt.grid(True)
plt.show()
plt.figure(figsize=(10,7))
plt.scatter(data['Total expenditure'], data['Life expectancy'], color='red')
plt.title('Life expectancy Vs Total expenditure', fontsize=14)
plt.xlabel('Total expenditure', fontsize=14)
plt.ylabel('Life expectancy', fontsize=14)
plt.grid(True)
plt.show()
plt.figure(figsize=(10,7))
plt.scatter(data['Adult Mortality'], data['Life expectancy'], color='red')
plt.title('Life expectancy Vs Adult Mortality', fontsize=14)
plt.xlabel('Adult Mortality', fontsize=14)
plt.ylabel('Life expectancy', fontsize=14)
plt.grid(True)
plt.show()
plt.figure(figsize=(10,7))
plt.scatter(data['Life expectancy'][:200], data['Country'][:200], color='red')
plt.title('Country Vs Life expectancy', fontsize=14)
plt.xlabel('Life expectancy', fontsize=14)
plt.ylabel('Country', fontsize=14)
plt.grid(True)
plt.show()
sns.pairplot(data, height=3, diag_kind='auto', corner=True)
plt.show()
plt.figure(figsize=(10,10))
sns.boxplot(data['Life expectancy'], orient='v')
plt.show()
plt.figure(figsize=(8,8))
sns.boxplot(x="Status",y="Life expectancy",data=data)
plt.show()
data[data.columns[:]].corr()['Life expectancy'][:]
plt.figure(figsize=(20,20))
sns.heatmap(data.corr(), annot=True, fmt=".2")
plt.show()
X = data.drop('Life expectancy', axis=1)
y = data[['Life expectancy']]
print(X.head())
print(y.head())
print(X.shape)
print(y.shape)
X.loc[:, 'Country'] = X.loc[:, 'Country'].astype('category')
X.loc[:, 'Status'] = X.loc[:, 'Status'].astype('category')
X.loc[:, 'Country'] = X.loc[:, 'Country'].cat.codes
X.loc[:, 'Status'] = X.loc[:, 'Status'].cat.codes
X.head()
X = X.values
y = y.values
#split the data into train and test
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
from sklearn.linear_model import LinearRegression
linearregression = LinearRegression()
linearregression.fit(X_train, y_train)
print("Intercept of the linear equation:", linearregression.intercept_)
print("\nCOefficients of the equation are:", linearregression.coef_)
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
pred = linearregression.predict(X_test)
# Mean Absolute Error
mean_absolute_error(y_test, pred)
The mean absolute error (MAE) is the simplest regression error metric to understand. We’ll calculate the residual for every data point, taking only the absolute value of each so that negative and positive residuals do not cancel out. We then take the average of all these residuals. Effectively, MAE describes the typical magnitude of the residuals.
# RMSE
mean_squared_error(y_test, pred)**0.5
The root mean square error (RMSE) is just like the MAE, but squares the difference before summing them all instead of using the absolute value. And then takes the square root of the value.
# R2 Squared:
r2_score(y_test, pred)
R^2 (coefficient of determination) regression score function.
Best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a R^2 score of 0.0.
# Training Score
linearregression.score(X_train, y_train)
# Testing score
linearregression.score(X_test, y_test)
df = pd.DataFrame({'Actual': y_test.flatten(), 'Predicted': pred.flatten()})
df
We can also visualize comparison result as a bar graph using the below script :
Note: As the number of records is huge, for representation purpose I’m taking just 25 records.
df1 = df.head(25)
df1.plot(kind='bar',figsize=(16,10))
plt.grid(which='major', linestyle='-', linewidth='0.5', color='green')
plt.grid(which='minor', linestyle=':', linewidth='0.5', color='black')
plt.show()
linearregression_ = LinearRegression(normalize=True) # Linear Regression with hyperparameter normalize=True.
linearregression_.fit(X_train, y_train)
print("Intercept of the linear equation:", linearregression_.intercept_)
print("\nCOefficients of the equation are:", linearregression_.coef_)
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
pred = linearregression_.predict(X_test)
# Mean Absolute Error
mean_absolute_error(y_test, pred)
# RMSE
mean_squared_error(y_test, pred)**0.5
# R2 Squared:
r2_score(y_test, pred)
# Training Score
linearregression_.score(X_train, y_train)
# Testing score
linearregression_.score(X_test, y_test)
The Training and testing scores are around 83% and both scores are comparable, hence the model is a good fit.
R2_score is 0.83, that explains 83 % of total variation in the dataset. So, overall the model is very satisfactory.
Setting normalize=True do have impact on co-efficients but they dont affect the best fit line anyway. (refer the coefficients and intercepts are changed, but the accuracy is same as before.)
import statsmodels.api as sm
X = sm.add_constant(X)
linearmodel = sm.OLS(y, X).fit()
predictions = linearmodel.predict(X)
print_model = linearmodel.summary()
print(print_model)
Schooling coeff: It represents the change in the output Y due to a change of one unit in the Schooling (everything else held constant).
P >|t|:
# Plot between residual(actual - predicted) and predicted values
plt.figure(figsize=(10,8))
plt.scatter(linearmodel.resid, linearmodel.predict(), marker='*')
plt.show()
# error distribution
plt.figure(figsize=(10,8))
sns.distplot(linearmodel.resid, hist=True, kde=False, color='red')
plt.show()